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Abstract —Unaggregated data, in a streamed or distributed form, is prevalent and comes from diverse sources such as interactions of 
users with web services and IP traffic. Data elements have keys (cookies, users, queries) and elements with different keys interleave. 
Analytics on such data typically utilizes statistics expressed as a sum over keys in a specified segment of a function / applied to 
the frequency (the total number of occurrences) of the key. In particular. Distinct is the number of active keys in the segment, Sum 
is the sum of their frequencies, and both are special cases of frequency cap statistics, which cap the frequency by a parameter T. 
An important application of cap statistics is staging advertisement campaigns, where the cap parameter is the maximum number of 
impressions per user and the statistics is the total number of qualifying impressions. 

The number of distinct active keys in the data can be very large, making exact computation of queries costly. Instead, we can estimate 
these statistics from a sample. An optimal sample for a given function / would include a key with frequency w with probability roughly 
proportional to f(w). But while such a ’’gold-standard” sample can be easily computed over the aggregated data (the set of key- 
frequency pairs), exact aggregation itself is costly, requiring state proportional to the number of active keys. Ideally, we would like to 
compute a sample without exact aggregation. 

We present a sampling framework for unaggregated data that uses a single pass (for streams) or two passes (for distributed data) 
and state proportional to the desired sample size. Our design unifies classic solutions for Distinct and Sum. Specifically, our ^-capped 
samples provide nonnegative unbiased estimates of any monotone non-decreasing frequency statistics, and close to gold-standard 
estimates for frequency cap statistics with T = B{£). Furthermore, our design facilitates multi-objective samples, which provide tight 
estimates for a specified set of statistics using a single smaller sample. 
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1 Introduction 

The data available from many services, such as interactions 
of users with Web services or content, search logs, and IP 
traffic, is presented in an unaggregated form. In this model, 
each data element has a key from a universe X and a weight 
w > 0. Data elements with different keys interleave in a data 
stream or distributed storage. 

The aggregated view of the data is a set of pairs that consists 
of an active keys x € X (a key that occurred at least once) 
and the respective total weight Wx of all elements with key 
X. When all element weights are uniform, Wx is the number 
of occurrences of an element with key x in the data. The 
weight Wx is often referred to as the frequency of the key (it 
is proportional to the actual frequency in the data set). 

Frequency statistics of such data are fundamental to data 
analytics. Queries have the form 

Qif,H)= Y. 

xexnH 

where f{w) > 0 is a nonnegative function such that /(O) = 0 
and H is a selection predicate that specifies a segment of the 
key population X. Typically / is monotone non decreasing, 
which means that more frequent keys carry at least the same 
contribution as less frequent ones. Some prominent examples 
are the frequency moment, where f{x) = for p > 0 ITI 
and frequency cap statistics, where / is a cap function with 
parameter T > 0: 

capj^iy) = min{y,r} . 


Two special cases of both cap statistics and frequency 
moments, that are widely studied and applied in big data 
analytics, are Distinct - the number of distinct (active) keys in 
the segment {Lq moment or capj^, assuming elements weights 
are > 1), and Sum - the sum of weights of elements with keys 
in the segment (Li moment or cap^^). 

Frequency caps that are in the mid-range mitigate the 
domination of the statistics by the (typically few) very frequent 
keys but still provide a larger representation of the more 
frequent keys. Mid-range frequency cap statistics are prevalent 
in online advertising platforms m, M- A common practice 
is to allow an advertiser to specify a limit to the number 
of impressions of an ad campaign that any individual user 
is exposed to in a particular duration of time. Advertise¬ 
ments also typically target only a segment H of users (say 
certain demographics and geographic location). The statistics 
Q{caprp, H) is the number of qualifying opportunities for 
placing an ad. These queries are posed over past data in order 
to provide an advertiser with a prediction for the total potential 
number of qualifying impressions. Often, the prediction needs 
to be computed or estimated quickly, to facilitate interactive 
campaign planning. 

An exact computation of frequency statistics 0 requires 
aggregating the data by key. The representation size of the 
aggregated view, however, and the runtime state needed to 
produce it, are linear in the number of distinct keys. Often, 
the number of distinct keys is very large and our system can 
be using the same resources to process many different streams 
or workloads. To scalably mine such data, our computation 
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needs to be limited to one or few passes over elements while 
maintaining a small runtime state (which translates to memory 
or communication). A single pass (stream computation) is 
necessary when the data is discarded (such as with IP traffic) or 
when statistics are collected for live dashboards. Under these 
constraints, we often must settle for a small summary of the 
data set which can provide approximate answers d, EOl, 

m. 

A solution which only addresses the final summary size is 
to hrst compute the aggregated view, and then retain a sample 
for future queries: For each key we compute a weight equal to 
f{wx), and we then apply a weighted sampling scheme such 
as Probability Proportion to Size (pps) lf35l . VarOpt ID, ifTOl . 
or bottom-/c, which includes successive weighted sampling 
without replacement (ppswor) and sequential Poisson/Priority 
sampling ||33|, IHI, Gil- 

From the weighted sample we can compute approximate 
segment frequency statistics by applying an appropriate es¬ 
timator to the sample. There is a well-understood tradeoff 
between the sample size k and the accuracy of our ap¬ 
proximation. For a segment H that has proportion q = 
Q{f,H)/Q{f,X) of the statistics value on the general key 
population, the coefficient of variation (CV) is (roughly) 
the inverse of the square root of qk. The CV is the 
standard error normalized by the mean, and corresponds to the 
NRMSE (normalized root mean square error). That is, in order 
to obtain NRMSE of e = 0.1 (10%) on segments that have at 
least q — 0.001 fraction of the total value of the statistics, 
we need to choose a sample size of fc = = 10®, 

which is usually much smaller than the number of distinct 
keys we might have. This also means that we can obtain 
conhdence intervals on our estimates using the actual number 
of samples from our segment. Moreover, this CV bound is the 
best we can hope for (on average over segments) and will be 
the gold standard we use in the design of sampling schemes 
and estimators in more constrained settings, which preclude 
aggregating the data. 

The challenge is to produce an effective sample of the data, 
using one or few passes while maintaining state that ideally is 
of the order of the desired sample size. There is a large body 
of work on stream sampling schemes designed for distinct and 
sum queries. The Sample and Hold (SH) family of sampling 
schemes ll20l . ifTSl . ||7l . 191 and another based on VarOpt |0 
are suited for sum queries. Distinct reservoir sampling of keys 
ll28l . ifTSl is suited for distinct queries. Both SH 191 and distinct 
sampling support unbiased estimates of all frequency statistics 
and meet our ® CV upper bound target for the particular 

statistics they are designed for (the claim for SH is established 
here). They do not provide, however, comparable statistical 
guarantees for other statistics. 

Contributions and Road Map 

Our main contribution is a general sampling framework for 
unaggregated streams. The sampling scheme is specihed by a 
random scoring function that is applied to stream elements. 
The scoring function is tailored to the statistics we want to 
estimate. Our framework is presented in Section and we 


cast the existing distinct and SH sampling schemes as special 
cases. 

Our framework facilitates the hrst stream sampling solution 
with CV upper bound that is close to the ® gold 

standard for general frequency cap statistics. We offer two 
basic designs: A discrete spectrum that only handles uniform 
elements and a continuous spectrum which handles arbitrary 
positive element weights. 

Our discrete spectrum is presented in Section The sam¬ 
pling algorithms SH^ are parametrized by an integer cap 
parameter i. When i exceeds the maximum frequency over 
keys, SH^ is equivalent to classic SH. Eor £ = 1, it is identical 
to distinct sampling. We derive unbiased and admissible esti¬ 
mators for any discrete frequency statistics, that is, / specihed 
for nonnegative integers. 

Our continuous spectrum SH^, for a positive real cap 
parameter £, is presented in Section When £ ^ max^, Wx, 
SHf is identical to weighted SH Q. Eor £ <C min^; Wx, 
SHf is distinct sampling. We derive estimators of frequency 
statistics where the function / is continuous and differentiable 
almost everywhere. Note that most natural statistics, including 
frequency moments and cap statistics, can be expressed as con¬ 
tinuous monotone functions, which are differentiable almost 
everywhere. Surprisingly perhaps, the continuous spectrum, 
which may seem less intuitive than the discrete spectrum, 
yields an elegant and simple specihcation of estimators. 

We show that our estimates of capj^ statistics from SH^ 
samples have CV upper bounded by 0{{qk)~^'^) when 
T — 0(£). The CV bound gracefully degrades with disparity 
max{T/£, £/T} between the sample cap parameter £ and the 
statistics cap parameter T. The estimate of any frequency 
function / is unbiased and for / that is monotone non¬ 
decreasing, also nonnegative. This makes our design very 
versatile. 

Our estimators are derived by expressing sampling as a 
transform from frequencies to expected “sampled frequencies,” 
and then inverting the transform. The transform is a matrix 
vector product In the discrete case and an integral transform 
in the continuous case. For the latter, the estimator is a simple 
expression in terms of / and its derivative /'. Since our 
estimators are the unique inverse of the transform, they are 
the minimum variance unbiased nonnegative estimators for 
the sampling scheme, meaning that in terms of variance, they 
optimally use the information in the sample. Our discrete 
estimators generalize a matrix inversion applied in US, El 
to estimate the flow size distribution from Sampled Netflow 
and SH IP flow records. Our continuous spectrum estimators 
are novel even for the basic weighted SH scheme, for which 
previously only estimators for sum statistics were provided 

Q. 

In Section]^ we address applications that require estimates 
with statistical guarantees for multiple, possibly all, cap statis¬ 
tics. One solution is to compute a set of samples with different 
cap parameters which cover the range of statistics we are 
interested in. A cap^^ statistics query can then be estimated 
from the sample that has £ parameter closest to T. We propose 
a design of a single multi-objective sample that offers both 
more efficient sampling and a better tradeoff of accuracy and 
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sample size. The design is based on our continuous spectrum 
and draws on a multi-objective design for aggregated data AH 
and the notion of sample coordination 0. 

Our proposed sampling algorithms and estimators are sim¬ 
ple and highly practical, despite a technical analysis. The 
application resembles that of classic (uncapped) SH, distinct 
sampling, and approximate distinct counting algorithms that 
are prevalent in industrial applications EH . Section [^includes 
an experimental evaluation which demonstrates superior accu¬ 
racy versus sample size tradeoffs by using a sample that is 
suited for the statistic. 

2 Preliminaries 

We work with key value data sets that consist of elements 
{x, w), where x is a key from a universe X and u> > 0. 
The data set is aggregated if each key appears in at most one 
element and is unaggregated otherwise. We define the weight 
Wx = of a key x to be the sum of the weights of 

elements with key x. If x is not active (there are no elements 
with key x), we define Wx = 0. When element weights are 
uniform, we define Wx to be the number of elements with 
key X. The aggregated view of an unaggregated data set has 
elements {x,Wx) for all active keys x. 

We are interested in sampling algorithms that process the 
unaggregated data in one or few passes while maintaining state 
that is proportional to the sample size. Such algorithms can 
be scalably executed when elements are streamed (presented 
sequentially to the algorithm) or distributed across multiple 
locations. 

We start with a quick review of relevant sampling schemes 
for aggregated data sets. A Poisson sample of a key value 
dataset {(XjtUa^)} is specified by sampling probabilities px- 
The sample S includes each x G X with independent 
probability px and has expected size E[|S'|] = J^xP^ = 

To estimate a frequency statistics Q{f,H) from the sample, 
we can apply the inverse probability estimator Q{f, H) = 
J^xGHnS fi.'^x)/Px EH- This estimator can be interpreted as 
a sum of per-key estimates that are f{wx)/px if x € S' and 0 
otherwise. Note that this estimator can only be applied when 
Wx and Px are available for all x G S. It is nonnegative and 
is unbiased if Px >0 when f{wx) > 0. It is actually the 
minimum variance unbiased and nonnegative sum estimator 
(sum of per-key estimates) for the given probabilities {px}- 

For a dataset {{x,Wx)}, function /, and (expected) sample 
size k, one can ask what are the “optimal” sampling probabil¬ 
ities. It is well known that if we sample keys with probability 
proportional to their contribution f{wx) (pps), we minimize 
the sum of per-key variances J^x fi'^x)^{f/Px — f)- With pps, 
we have the following statistical guarantee; For estimates of 
the statistics Q{f,H), where the segment H has proportion 

QiLH) E.gg/K) 

QiLX) ExfM 
of the statistics, the variance of our estimate is 

yar[Qif,H)]<^Q{f,Hf . 


Thus the CV (normalized standard error) is at most 
which is the best bound we can hope for on average over 
segments with proportion q. That is, any scheme that would 
do better on some segments, would do worse on others. Other 
weighted sampling schemes we mentioned in the introduction 
provide this statistical guarantee with a fixed sample size k: 
VarOpt provides the ® quality with better estimation 

for q closer to 1. Sequential Poisson (Priority) sampling has 
{q{k — 1))“°-® quality 041 . 

One of these schemes that is particularly relevant for our 
treatment of unaggregated data sets is ppswor: Keys are drawn 
successively so that at each step the probability that we 
draw X is proportional to its weight relative to the remaining 
unsampled keys: f{wx)/'Y^y^gf{wy). The sampling can be 
realized by associating with each key a random seed {x) ~ 
Exp[/(w 2 ,)] (exponentially distributed seed with parameter 
f{wx)) ES- Ordering keys by increasing seed value turns 
out to correspond exactly to ppswor sampling order. A fixed- 
threshold sample, for a pre-specified threshold r, includes all 
keys with seed (x) < r. Alternatively, we can obtain a fixed 
size (bottom-A:) sample by taking the k keys with smallest seed 
values. In the latter case, it is convenient to define r as the 
{k -f 1) smallest seed. 

Finally, we can estimate a statistics Q{g,H) from the 
ppswor sample taken for weights f{wx) as follows. When 
we use fixed threshold sampling, we compute the probability 
that X is sampled 

^t{wx) = Pr[seed (x) < r] = 1 — , 

and apply inverse probability: 

Q{9,H)= 9{wx I t), where g{wx \ t) = ' 

x^HnS r[Wx) 

( 2 ) 

Note that $T-(u;a;) only depends on Wx and r (which are 
available for sampled keys). When we work with a fixed 
sample size k and define r to be the (fc -f 1) smallest seed, 
we can interpret ^riwx) as the probability that the key x is 
sampled, conditioned on fixed randomization of other keys. 
This means that the estimator (|^ is unbiased im. Moreover, 
the estimates g{wx \ t) obtained for different keys x have 
zero covariances HU, which allows us to bound the variance 
on segment queries as we would do when sampling with a 
pre-specified threshold. It turns out (see Theorem ED that 
for statistics Q{f,H) with proportion q, the CV is at most 
{q{k — 1))“°'®, which is essentially (within a single sample) 
our “gold standard” CV. 

A ppswor sample with respect to any function f{wx) can 
be computed from a streamed (or distributed) aggregated 
data {(x, Wa:)}, using state proportional to the sample size. 
This is not generally possible, however, over unaggregated 
data: For example, there are polynomial lower bounds on the 
state needed by a streaming algorithm which approximates 
frequency moments Q{xP,X) with v> 2 ifTl. 

3 Sampling framework 

We present a framework for sampling unaggregated data sets 
and cast SH and distinct sampling in our framework. Our 
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algorithms compute a sample S while maintaining state, in the 
form of a cache S of sampled keys, that is proportional to the 
sample size. Each sampling scheme in specified through a ran¬ 
dom mapping ElementScore {h) of elements h = {x,w) 
to numeric score values. The distribution of ElementScore 
may only depend on the key x and w. We then dehne the seed 
of a key x 

seed(x) = min ElementScore (h) (3) 

h with key x 

to be the random variable that is the minimum score of all 
its elements. As with ppswor, we can obtain a fixed-threshold 
sample S' = {x | seed{x) < r}, which for a given r includes 
all keys with seed(x) < r, or a fixed-size sample, which for 
a specihed sample size k includes the k keys with smallest 
seed values and define r to be the (fc -f 1) smallest seed. 

Once we have the sample, we can apply estimators to it to 
approximate statistics. To do so, we need to have information 
on the weight of sampled keys. The exact weights Wx of 
sampled keys x G S can be computed in a second pass over 
the data, as we detail below. We also consider a pure streaming 
(single sequential pass) setting, where we generally settle for 
some Cx < Wx and we derive estimators that are able to work 
with this information. In terms of computation platform, our 
2-pass schemes can be fully parallelized or distributed whereas 
our 1-pass (streaming) schemes are not as flexible: They can 
be executed on multiple streams that are processed separately 
(as with sharding) provided that all elements with the same 
key are processed at the same shard. 

3.1 2-pass scheme 

The first pass identifies the set of keys S with smallest seeds. 
For fixed-threshold sampling, our summary contains all keys 
with scores below r. With fixed-size sampling, the summary 
contain the keys with k smallest minimum scores. These 
summaries are mergeable, that is, from the summaries of two 
data sets we can compute a summary of their union. For 
hxed-r, the merged summary is the union of the keys in the 
two summaries. For fixed-fc, we compute the seed of each 
key in the union as the minimum seed attained in each of 
the summaries. We then take the k keys with smallest seeds 
(retaining their seed values) as a summary of the union. Either 
way the summary sizes never exceed [S'], which is the final 
sample size. The second pass, which computes Wx for x G S 
uses summaries that are the weight of each key x G S the data 
set. We merge two summaries by key-wise addition of weights 
to obtain a summary of the union. Algorithm [T] is 2-pass stream 
sampling of a hxed sample size k. Simple variations handle 
distributed or parallel computation or fixed threshold sampling. 

3.2 Fixed threshold stream sampling 

A fixed threshold scheme processes an element h = {x, w) as 
follows. If X S S' (key x is cached/sampled), then Cx ^ Cx+w. 
Otherwise, if ElementScore (h) < r, then x is inserted 
to S and Cx < w is initialized. The discrete scheme which 
applies to uniform weights w = 1, is provided as Algorithm 
1^ and uses the initialization Cx <— 1 (Counters[jr] in the 
pseudocode). A continuous scheme is presented in Section 


Algorithm 1: 2-pass stream sampling: hxed size k 
Data: sample size k, elements (x, w) where x G X and 
w > 0 

Output: set of k pairs (x, Wx) where x G X 
Counters 4—0 // initialize sample 

Ti -hoo // Upper bound on ElementScore 

// Pass I: Identify the k sampled keys 
foreach stream element h = (x, w) do 
if X is in Counters then 
seed (x) G- 

minjseed {x ), ElementScore (h) } 

else 

s G- ElementScore {h) 

if s < T then 

seed (x) G- s; Counters[4[:] 0 
if |Counters| = k + l then 
4—argmaxjseed (jc) | 

X in Counters} 

T 4— seed (y) 

delete seed{y), Counters[y] 

// Pass II: Compute Wx for sampled keys 
foreach stream element h = (x, w) do 

if X is in Counters then 
1_ Counters[r] Counters[r] + w 

return (t ; (x, Counters[x]) for x in Counters) 


Algorithm 2: Stream sampling with hxed threshold r 
Data: threshold r, stream of elements with key x G X 
Output: set of pairs {x,Cx) where x G X and 
Ca; G [1, Wx] 

Counters 4-0// initialize Counters cache 
foreach stream element h with key x do // Process 
a stream element 

if X is in Counters then 
1_ Counters^ Counters[.r] + 1; 

else 

if ElementScore (h) < r then 
1^ Counters[jr] 4—1; // initialize Cx 

return ((x, Counters[4c]) for x in Counters) 


3.3 Fixed size stream sampling 

Algorithm provides pseudocode for discrete (uniform 
weights) stream sampling with a hxed sample size k. 

The algorithm maintain a set S (Counters) of cached keys. 
For each cached key x, it keeps a count Cx (Counters[jt:]) and a 
lazily computed seed value seed(x). When processing an el¬ 
ement h with key x, we compute y G- ElementScore [h). 
If X G S, we increment Cx- 

Otherwise, if x ^ S' and y < t, we insert x G S with Cx G- 
1 and seed(x) y. As a result, we may have |S| = A: -P 1 
cached keys. In this case, we would like to evict from S the 
key with maximum seed. But the seeds are not fully evaluated 
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yet, in that the current seed(x) only reflect the seed up to 
the first element that is currently counted in Cx- 

We repeat the following until a key is evicted. We pop 
from S the key y with maximum current seed and set 
T •(— seed{y). We then iterate decreasing the count Cy and 
scoring “uncounted” elements until either the count becomes 
Cy = 0 and y is evicted or we obtain a score that is below r. 
In the latter case, we reinsert y to S with seed(y) equal to 
that score. 


Algorithm 3; Stream sampling with fixed size k 
Data; sample size k, stream of elements with key x G X 
Output: set of k pairs (x,Cx) where x G X and 

Cx G 

Counters •(— 0 // initialization 
T<—1 // Supremum of ElementScore range 
foreach element h with key x do 

if X is in Counters then 
|_ Counters^ Counters[v] + 1 

else 

score G- ElementScore (h) 
if score < r then 
seed (x) G- score 

Counters^ 1 
while |Counters| > k Ao 

2 /^ argmaxjseed (v) | 

X in Counters} 

T ^ seed (y) 

while Counters[y] > 0 and 

seed (y) > r do 

Counters[y] ^ Counters[y] - 1 
seed(y) ^ ElementScore (y) 

if Counters[y] == 0 then 
|_ delete Counters|y], seed{y) 


return (r ; (a;, Counters[r]) for x in Counters) 


Analysis: Clearly, the work of Algorithm (fixed- 
threshold sampling) is 0(1) per stream element. We show 
the following for fixed-size sampling; 

Lemma 3.1. The amortized per-element work of Algorithm^ 
is 0(1). 

Proof: The algorithm maintains at most k cached keys 
in a max priority queue, accessible by decreasing seed(a:). 
When there are fewer than k active keys, all of them are 
cached, and otherwise k keys are cached. The costlier op¬ 
erations are eviction steps, which happen when a new key is 
inserted and the cache is full. The expected total number of 
evictions is at most klnm', where m' is the expected number 
of distinct element scores. The value of m' depends on our 
element scoring function but is always at most the number of 
elements and at least the number of distinct keys (since scores 
of different keys are independent). In any case, the number of 
evictions is logarithmic in the stream size. 


In an eviction step, an element x is popped and Cx is 
decremented at least once. If the final count is not zero, then 
X is placed back on the queue with a strictly lower seed [x). 
The median value of the new seed{x) in this case is the 
median of the score distribution, provided it is lower than t. 

The total work on decreasing the counts can be “charged” to 
the processing of the corresponding element, but there is also 
a possible charge of a priority queue insertion for keys whose 
count got decreased and did not get removed. A priority queue 
operation cost is about O(logfc). We can bound the number 
of such operations by noting that in expectation, seed(a;) 
decreases so that in expectation the probability of a new 
element score being below it is halved. Which means that 
the expected number of times a key can be placed back is at 
most logarithmic in the number of distinct scores its elements 
can have. It also means that k “place backs” corresponds in 
expectation to a decrease of the threshold to the conditional 
median, which can happen when m' doubles. So in expectation 
there are 0(1) place backs per eviction step. □ 

3.4 Element scoring properties 

We will select the element scoring function according to the 
statistics / we are interested in. Intuitively, to obtain quality 
estimates (CV upper bound of {qk)~°'^), we would need to 
sample each key x with probability roughly proportional to 
f{wx)- The challenge is to identify when and how we can 
achieve this by a small state streaming algorithm. 

Some properties of our element scoring functions that 
greatly simplify the derivation of estimators are that seed 
values of different keys are independent and that for a particu¬ 
lar key X, the distribution of seed(a;) (the minimum element 
score) depends only on Wx, and not on the arrangement of 
elements or on the breakdown of the weight of each key 
to different elements. Furthermore, we would also want the 
distribution of Cx for a; € S' to only depend on Wx and r. We 
assume here that we work with perfectly random numbers and 
hash functions. 

3.5 Estimation 

As with the ppswor estimator reviewed in Section we use 
estimators that can be expressed as a sum over keys x G H 
of individual estimates f{wx) of f{wx)- The estimate are 
unbiased and are 0 for keys x ^ S. 

With two-pass sampling (Algorithm [T]), we have the weight 
Wx, and therefore f{wx), for each sampled key x G S. When 
the seed distribution only depends on Wx and r, we can 
compute the inclusion probability ^t{wx) of a key x from its 
weight Wx and apply inverse probability estimation as in (|^. 
In the streaming (single pass) schemes, the sample includes 
a partial count Cx < Wx for each x G S. The requirement 
that the distribution of Cx only depends on Wx and r allows 
us to express sampling as a transform (which depends on r) 
from the distribution Wx to the expected outcome distribution 
Cx- The derivation of unbiased estimators then corresponds to 
inverting this transform. 

The transforms we obtain have a unique inverse, which 
means that our estimators are the optimal (minimum variance) 
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unbiased and nonnegative sum estimators. Because the 2-pass 
estimators Q are also optimal, and rely on more information 
- the exact value Wx instead of a sample from a distribution 
with parameter Wx, the variance of the streaming estimators 
is always at least that of the 2-pass estimator. 

The estimators for both the fixed-threshold and the fixed 
sample-size schemes are stated in terms of the threshold 
probability t. When working with a fixed sample-size k, t is 
defined as the (/c-|-l)st smallest seed. As with ppswor, r when 
defined this way plays the same role as the threshold value t 
used in a fixed sampling threshold scheme: The probability 
that a key is sampled, conditioned on fixed randomization of 
other keys, is the probability that its seed value is below the fcth 
smallest seed of other keys. When the key is included in the 
sample, this value is r. Similarly, under the same conditioning, 
the distribution of Cx only depends on r and Wx, and is the 
same one as the respective fixed threshold scheme with r. 
Moreover, the covariance of the estimates obtained for two 
different keys x,y is zero. The argument is the same as with 
ppswor ifTTH and SH E). This important property allows us to 
bound the variance on estimates of segment statistics by the 
sum of variance of estimates for individual keys. 

We now cast two existing basic sampling schemes in our 
framework: Distinct, which is designed for capj^ statistics and 
SH, which is designed for cap^^ (sum) statistics. 

3.6 Distinct sampiing 

A distinct sample is a uniform sample of active keys (those 
with Wx > 0), meaning that conditioned on sample size k, 
all subsets of active keys are equally likely. For an element h 
with key x, we use ElementScore (h) = Hash (v) , where 
Hash (x) ^ t7[0,1] is a random hash function selected before 
we process the stream. Note that all elements of the same key 
X have the same score and therefore seed(x) = Hash{x). 

When we sample with respect to a fixed threshold r, we 
retain all keys with Hash (x) < t. When using a fixed 

sample size k, the scheme is the following (distinct variant) 
of reservoir sampling 1^ : For each stream element, compute 
Hash(x) and retain the k keys with smallest hash values. 

With distinct sampling, the value Cx is equal to the exact 
weight Wx for each sampled key x. This is because any key 
that enters our cache does so on the first element of the key. If 
a key is evicted, (in the fixed k scheme), it can never re-enter. 
We also have that for all keys with Wx > 0, the probability 
that X is sampled is = r“^. We can therefore apply 

the inverse probability estimator (|^: 

Q(/,iJ)=r-i ^ f{wx) . (4) 

xeSnH 

Distinct sampling is optimized for distinct (capj^) statistics. 
In particular, Q{caPi,X) has CV upper bounded by (fc — 
1)-°'® El, El and for a segment H with proportion q of 
distinct keys, Q(cap^,iF) has CV upper bounded by {q{k — 
1))“°-®, as it is equivalent to the ppswor estimator for f{w) = 
1. For general cap-^ statistics, however, the CV grows rapidly 
with T (we shall see it is oc \/T). This is because our uniform 
sample of active keys can easily miss keys with high f{wx) 
values which contribute more to the statistics. 


3.7 Sample and Hold (SH) 

Classic SH, with fixed sampling threshold r or with fixed sam¬ 
ple size k, Eoi, na is specified for uniform element weights, 
so that Wx is the number of elements with key x. We cast SH 
in our framework using ElementScore (h) ^ (7[0,1]. Note 
that each key x can have many independent scores drawn, one 
for each element of x. Therefore, the more elements a key has, 
the more likely it is to be sampled. The seed is the minimum 
element score, which can be transformed to an exponentially 
distributed random variable with parameter Wx- Therefore, as 
observed in 0, the SH sample is actually a ppswor sample 
with respect to the weights Wx ES- When we use a second 
pass (Section HD to obtain the exact weights Wx, we can 
apply the ppswor estimator 0. 

With stream sampling (Algorithm and Algorithm |^, the 
final count of a key x has Cx < Wx, where Wx — Cx + 1 is 
geometric with parameter r, truncated at uia; -I- 1 (probability 
of Cx = 0 is (1 — r)*"”^). An unbiased estimator for statistics 
Q{f,H) from an SH sample is 191: 0 

^ ff(cx)-f{cx-m-r)]. ( 5 ) 

xeSnH ^ ^ 

I^Note that this estimator is nonnegative when / is monotone 
non decreasing. This is because for all i > 0, fit) - fit - 
1)(1 — r) > 0. Surprisingly perhaps, we show here (Theorem 
|C.l| l that the 1-pass estimate is not too far from the 2-pass 
estimate in that for sum statistics (fix) = x) the CV is also 
upper bounded by iqik — 1))“°'^. 

For cap statistics with small T, however, the SH estimates 
can have CV that far exceeds our iqk)~^'^ target: When the 
frequency distribution is highly skewed, the ppswor sample 
would be dominated by heavy keys. This means that segments 
with a large proportion of the cap-j- statistics that mostly 
include keys with low frequencies would have a disproportion- 
ally small representation in the sample and thus large errors. 

4 The DISCRETE SH SPECTRUM 

Our discrete SH spectrum is parametrized by an integer ^>l. 
Distinct sampling is SHi and classic SH is SHqo. In general, 
SH^ is designed to estimate well frequency cap statistics with 
T^t 

The SH^ element scoring function for an element h with 
key X draws a uniform random bucket b ~ U[l, - ■ ■ ,£] and 
returns a hash of the pair Hash(x, b) ~ C/[0,1]. Note that the 
buckets are independent for different elements with key x. 

ElementScore (/;) ^ Hash([(f * rand () )J, x) . (6) 

Recall that seed(x) 0 is the minimum score of an element 
with key x. When £ = 1, the seed distribution is uniform 
for all keys with Wx > 0. More generally, we can see 
that the element scoring (|^ provides up to £ “independent” 
attempts for each key to obtain a lower seed. That way, keys 

1. Estimators for a related scheme (where elements are drawn with replace¬ 
ment) were presented in Q). 

2. With fixed-size sampling, we can instead use here the stratified value 

T = k/(k -h XlxGA’ “ ^xGSnx 
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Fig. 1. SFI^ sampling probability per key weight w, for 
selected values o\ i (t = 0.01). Note that for w » ilogi 
probability is constant and for w < £, probability is 
proportional to w. We can see that the probability is close 
to being proportional to min{w,£}, which is what we want 
for estimating cap^ statistics. 


with more elements are more likely to have a lower seed 
and be sampled, but with diminishing return: Keys where 
Wx ^ min{£, T“^} are sampled with probability roughly 
proportional to whereas keys with ^ min{£, t“^} 
have a roughly constant inclusion probability regardless of 
frequency. Also note that when the cap parameter is large 
relative to the inverse sampling threshold e > r-i, SUi 
is similar to SHoo. Figure [T] illustrates these properties by 
showing the sampling probability of a key as a function of 
Wx, for selected values of the parameter i. 

4.1 Estimators for discrete SH^ 

The output of our stream sampling algorithm is a threshold 
value T and a set S of parrs of the form (y, Cy), where y G X 
and Cy G [1, Wy], 

Coefficient form. We express our estimators as vectors 
which depends on /, the threshold r, and the param¬ 
eter The cth entry is the contribution to the estimate 

of a key with count c. The estimate on the statistics Q{f, H) 
is then 

Qif,H)= Y. 

xGSnH 

The distinct sample {i = 1) estimator 0 is expressed using 
Pi = fiT~^ (using the notation fi = f{i)) whereas the SH 

estimator (f = +(X)) 0 is expressed using Pi = — 

fi-i{l — r)^. We seek estimators of this form for general £ 

that are unbiased, admissible, and nonnegative /3 > 0 when / 
is non-decreasing. 

Probability vector (f>. Let pi be the probability that the ith 
element of the same key was the first one to get counted by 
SH^. The vector p depends on the parameters £ and r. 

For f = 1, we have the closed form pi = t and pi = 0 for 
i > 1. For £ = -foo, we have pi = {1 — 


To express p for general £, we let be the probability that 
we used exactly j < min{£, i} buckets in the first i elements 
of a key. 

By definition aoi = 0 when i > 1, Uij = 0 when j > 
mm{£,i}, and oi^o = 0. Otherwise, = 1 and for i > 1, 
j < min{^, *}, the values can be computed from the relation 


O-lj — 0,1-1,j n + 0^-l^j-l 


( 8 ) 


Note that as i grows, the vectors a^. converge to a vector that 
has all entries 0 except an = 1. It therefore suffices to compute 
these entries only until i = 0{£log{£)). For larger values of i 
we can use the vector a^. = (0,..., 0,1). 

We can now write 


min{ 2 —1,£—1} 

P, = T Y ^ • 

i=i 

Note that it always suffices to compute only the M first entries 
of p, where 


M = 0(min{flogf,T ^logr ^}) . 


A 2-pass estimator. The probability that a key x is sampled 
(illustrated in Figure 0 is 

Wx 

^T,i{Wx) = Y^j ■ 

7 = 1 

If we use a 2-pass scheme (Section [3T]l, we can apply the 
inverse probability estimator 

xeSDH 

Inverting the sample counts. We now derive a streaming 
estimator. We use the notation Oi = {x G SD H \ Cx = i} (the 
“observed” count) for the random variable that is the number 
of keys x G S D H with Cx = i- Let rrii = {x G H \ Wx = i} 
be the number of keys in H with count Wx = %■ Our statistics 
0 can be expressed as Q{f, H) = f^m. We have the relation 
E[o»] = J2j>^ Pj-i+im^ and can write 

E[o] = . 

We use the notation for an upper triangular matrix that 
corresponds to a vector v, such that V, j > i, [Y^'"'>]ij = 

We have m = o]. Therefore, from linearity, 

rh = (Y^'^'>)~^o is an unbiased estimator of m. Therefore, to 
compute the estimate we need to invert Y^'^\ 

The inverse of the matrix Y^'^'> has the same upper triangular 
structure, and can be expressed as with respect to 

another vector -p. To compute p, we consider the constraints 
y(V>)y(</>) _ / obtained from the product of the first row 
of y(’/’) with the columns of Y^'^’K We obtain the equations 
pi = Pi^, and for j > 1, 

i 

Y'^j4'i+i-j = 0 . 

7 = 1 

This allows us to iteratively solve for pi after computing pj 
for j < i using 

7 = 1 










For distinct sampling we have i/ji = and ipi = 0 for 
i > 1. For SH 13 we have i/ji = 'ip 2 = —(1 — 
and ipi = 0 for i > 2. In general, however, 'll) can have many 
non-zero entries. 

We show the following : 

Theorem 4.1. The estimator Q{f, H) = ’ tv/iere 

^ ^ Ipjfi-j + l 

i=i 

is unbiased. 

Proof: By substituting rh = in Q{ f) = f^m, we 

obtain the estimator Q{f) = 

The unbiased estimate for rrii is 

Uli — ^ ^ — . 

j>i 

The unbiased estimate for the contribution of keys with i 
elements to the statistics is 

fidli = fi ^ ^ Oj'fj — iJ^i . 
j>i 

Therefore, the total contribution, and expressed in terms of 
Oi is 

i 

fiOji’j-i+i = y]] oi ipjfi-j+i. 


Since the inverse is unique, our estimator is the only 
unbiased estimator of this form and thus also admissible 
(minimum variance of this form). Note that when we only 
compute the first M entries of ip, we limit the sum expression 
to range from 1 to min{M, i}. In applications, the coefficients 
P only need to be computed for i such that there is at least 
one key x in the sketch with = i- 

We show that the estimates are nonnegative when / is 
monotone non-decreasing: 

Theorem 4.2. When f is monotone non-decreasing, then for 
all £ and t, > 0. 

Proof: We first claim that any prefix sum of i/) is positive. 
That is, 

3 

vj>i, y]V'*>o. (9) 

i=l 

We prove the claim by induction on i. The base case of the 
induction has ipi = > 0. We now show that 

EiLi V'i > 0 if the claim ([^l holds for all j < h. We have 

h 

i=i 

h h—1 j 

= +y]](y]] - ^h-j) ■ 

i—l J = 1 i—1 


Rearranging, we obtain 

h h—1 j 

- fh-j+i) ■ 

i—l i = l i—l 

We now argue that the right hand side is nonnegative. In fact, 
each summand, and each term in the product are nonnegative. 
Nonnegativity of the sums X]i=i '4’i follows from the induction 
hypothesis for j < h. Nonnegativity of the differences (f>h-j — 
(ph-jJri for j < h follows from fi > 0 being non-increasing 
(recall that fi is the probability that the *th element of a key 
is the first one to be counted). Now, the left hand side is 
nonnegative and fi — > 0. Therefore, .1 fj > 0. 

We now use the claim on the prefix sums of -f to show that 
the estimation coefficients are nonnegative. 

i 

Pi = 

f=i 

i i—h-\-l 

= '^{fh-fh-l) '^3 ■ 

h=l j=l 

We now observe that the right hand side is nonnegative. This 
follows from From monotonicity of / and our claim ([^l on 
the nonnegativity of the if prefix sums 

□ 

5 The CONTINUOUS SH SPECTRUM 

We now present our continuous SHf sampling schemes, which 
generalizes SH with weighted updates {£ = c») Q. The 
continuous design offers the following advantages over the 
discrete design even when applied to uniform weights. First, 
fixed sample-size sampling no longer requires explicitly main¬ 
tain a lazy seed(a;) for cached keys as we did in Algorithm]^ 
The lazy value is implicitly captured by the current threshold 
T. Second, the estimator can be expressed in terms of / and 
its derivative. Lastly, the continuous spectrum facilitates multi¬ 
objective samples (Section!^. 

Our input is a stream of elements h = (x, w) with key 
X and a weight w > 0. Our element scoring is as follows: 
Each key has a base hash KeyBase(x) ^ U[Q,\/£], that 
is fixed for the computation and is uniformly distributed in 
[0,1/f]: KeyBase(x) ^ Hash(x)/f. An element h = (x, w) 
is assigned a score by first drawing v ~ Exp[r(;] and then 
returning v if v > l/£ and KeyBase(x) otherwise: 

ElementScore {h) = (v ~ Exp[u;]) < l/£ ? KeYBase(x) : v . 

( 10 ) 

The random variables Exp[r(;] are independent for different 
elements and are also independent of KeyBase(x). 

We now consider the distribution of seed{x) (the minimum 
element score of stream elements with key x). We show 
that seed(x) ^ U[0, l/f] with probability (1 — and 

seed{x) ^ l/£ -f Exp[r(;a;] otherwise: 

Lemma 5.1. 

seed(x) ^ (r; ~ Expl-Wx]) < f/£ ? U[0, l/.f] : v . 

Proof: If at least one of the random variables Exp[ri;(/i)] 
for ft, G X is smaller than \/£, then seed(x) = KeyBase (x). 
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The distribution of the minimum minh^x Exp[?ii(/i)] is 
ExpE?.gx^(^)] = Exp[u;:r] (the distribution of the minimum 
of independent exponentially distributed random variables 
with sum of parameters Wx is exponentially distributed with 
parameter Wx)- So we obtain that if y ~ Exp[r(;a;] is such 
that y < \jt, which happens with probability 1 — the 

seed is KeyBase [x). Otherwise, seed(a;) = y. We now use 
the memoryless property of the exponential distribution, which 
implies that the conditional distribution of y — Ijl given that 
y > 1/1 is l/£ + Exp[r(;a;]. So with probability e~'"^ the 
distribution is 1/f + Exp[w 2 ;]. □ 

Note that the element scoring satisfies our requirement 
(Section 3.4 1 that the distribution of seed(x) depends only 
on Wx- Qualitatively, when Wx ^ the seed is close to 
exponentially distributed with parameter Wx, which is ppswor. 
When Wx ^ i, the seed is uniform, which results in distinct 
sampling. We obtain the property that the sampling probability 
a key x is roughly proportional to Cdi.P(^{wx), which is needed 
to approach the “gold standard” CV. 


5.1 2-pass estimator 

Consider 2-pass sampling (Section HD with our element 
scoring function ( [TOl i. For estimation, we need to compute 
the probability ^x,e{wx) = Pr[seed(a;) < t ] of a key 
with weight Wx in a sample with parameters £ and r. If 
t£ < 1, then a key is included if Exp[wa;] < l/£ and then 
KeyBase (x) < t. These two events are independent and 
have joint probability (1 — e~'"^^^)T£. If t£ > 1 then a key is 
included if Exp[rya;] < r, which has probability (1 — 

We can express the combined probability as 

^rA^x) = (1 - e-"'^“""{i/^’">)min{l,r4 . (11) 


We can then apply inverse probability to estimate a 
segment statistics 




E 

x&SnH 


fjwx) 


We show the following (Proof provided in Appendix |D.l| i: 

Theorem 5.1. The CV of estimating Q{caprp, H)from an SHg 
sample which provides exact weights Wx (x € S) is at most 

/ e max{r/£,£/r} \°-^ 

Ve-1 g(k-l) J 


When £ = T, we obtain a bound of at most 1.26 times 
the CV bound of (we can obtain for sam¬ 
ples computed over the aggregated data (Section [^. When 
£ = 0(T), the CV is 0{/q{k — 1))“°'^) and the upper bound 
degrades smoothly with the disparity max{T/f, ^/T} between 
£ and T. Also note that the increased CV due to the constant 
e/(e — 1) and the disparity arise from a worst-case analysis 
and are not inherent. 

Figure shows the relative inclusion probabilities as a 
function of the weight Wx for SHio, and pps and ppswor with 
respect to cap]^g(i(;a;). The “gap” between the ratios for SHig 
and for the pps/ppswor (which are realizable on aggregated 
data) illustrates our loss relative to the “gold standard” CV. 


We can see that the gap is larger for weights that are around 
the cap parameter of 10, and maximizes at the ratio (1 — 1/e). 
In this sense, data with many keys with weight close to the 
cap thresholds are the “worst case” for the variance. 



Fig. 2. (Relative) inclusion probabilities as a function of 
the weight Wx for SFlio, pps, and ppswor, computed with 
respect to cap^g. The y axis shows the inclusion probabil¬ 
ity as a function of the maximum inclusion probability. For 
all schemes we normalized the threshold to be such that 
the inclusion probability maximizes at 0.01. 


5.2 1-pass algorithms 

The streaming (1-pass) algorithms compute a sample S of 
cached keys and a value Cx < Wx for each x G S. 

Algorithm 12 performs fixed threshold sampling. When pro¬ 
cessing an element h = {x, w) with a cached key, we update 
Cx ^ Cx + w. Otherwise, we compute the weight A that 
would be needed for the score to be below max{l/£, t}. 
If ly < A, we break. Otherwise, if r < l/f, we break if 
KeyBase (x) > r. Finally, we initialize a couter Cx G- w—A. 
Intuitively, the score is continuously assigned to the mass Wx- 
The value Cx < Wx is the weight observed after the point in 
which the score gets below r. 


Algorithm 4: Continuous SH^ stream sampling; fixed r 
Data; threshold r, stream of elements {x, w) where 
X G X and w > 0 

Output; set of pairs {x,Cx) where x G X and 
Cx G ( 0 , Wx] 

Counters ^0// initialize Counters cache 
foreach stream element {x,w) do / / Process a 
stream element 

if X is in Counters then 
1_ Counters[x] Counters[x] + w; 

else 

^ - mEni/1}^ // ^ Exp[max{r, l/£}] 
if KeyBase (x) < min{T, !/£} and A < w then 
// initialize counter for x 

[_ Counters[x] ^ w - A; 

return ((m, Counters[x]) for x in Counters) 
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Fixed sample size sampling is provided as Algorithmic To 
maintain a fixed size sample, the threshold is decreased when 
there are fc +1 cached keys, to the point needed to evict a key. 
The algorithm “simulates” the end result of working with the 
lower threshold to begin with. 

The eviction step is as follows. We draw and fix some 
“randomization” and compute for each cached key the thresh¬ 
old needed to evict the key. The randomization for key x, 
in the form of and r^- We then compute which is the 
maximum threshold value that is needed to evict x with respect 
to that randomization. We then take the new threshold to be the 
maximum Zx over keys. One key (the one with maximum Zx) 
is evicted. For remaining keys, Cx (Counters[jii:]) is updated 
according to the same Ux,rx- 

We elaborate on how Zx is determined when the current 
threshold is r. The key x can be viewed as having a score 
(computed to the point the key entered the cache) that is at 
most T. We can consider the distribution of the score given that 
it is at most r; With the randomization, we can take it as UxT. 
A necessary requirement for x to be evicted is that the new 
threshold r* is below UxT, so we have Zx < UxT. Conditioned 
on T* < UxT, we can treat this as processing an element 
with a new (uncached) key x and weight Cx- We consider the 
threshold value t* needed for the key to enter the cache. We 
simply reverse the entry rule: If — ln(l — Vx^jcx > then 
the key would enter the cache when r* > — ln(l — rx)lcx- If 
— ln(l — Tx^jcx < , then the key would enter the cache if 

and only if r* > KeyBase {x) , with count Cx — ({— ln(l — 

r.))- 

We now express the distribution of Cx (Counters[x]) and 
verify that it satishes our requirement that for any key x, it 
only depends on Wx, i, and t. Recall that for fixed-threshold 
SH^ we use the specihed r whereas with hxed-cache size SH^, 
the statement is conditioned on the randomization on all other 
keys, which determines t when x G S. The proof is provided 
in Appendix 

Theorem 5.2. With fixed-r SHi (Algorithm^ andfixed-k SHg 
(Algorithm^, for any key x, Cx max{0, Wx — </>}, where f 
has density 

(j)[y) = Texp(-2/max{l/£,r}) 
in the interval y G [OjiCx]- 

Optimization comment: Batch evictions 

Each decrease of the threshold involves scanning all keys in 
the cache. The expected total number of evictions, however, is 
at most k \nm, where m is the number of elements. To reduce 
amortized eviction cost, we can use a slight modihcation of 
the algorithm which evicts 6 keys when the cache is full, 
where (5 is a fraction of k. The modification uses the 5th 
largest Zx instead of the maximum one as the new r*. All 
keys y with Zy > t are then evicted. The new threshold is r*. 
The computation of the estimators, which we present next, is 
with respect to the current threshold and is the same with the 
batched evictions or one at a time eviction. 


Algorithm 5: Continuous SH^ stream sampling: fixed k 
Data: sample size k, stream of elements of the form 
{x, w) with key x G X and w > 0 
Output: r; set of pairs {x,Cx) where x G X and 
Ca: G (0, Wx\ 

Counters ^ 0; T ^ oo // initialize cache 
foreach stream element {x,w) do // Process 
element 

if X is in Counters then 
\_ Counters^ Counters[.r] + w: 

else 

~Exp|m„{(-.r)| 

if A < w and (t£ > 1 or ri < 1 and 
KeyBase (x) < t) then // insert x 

Counters[x] ^ w - A 

if |Counters| = k + 1 then // Evict a 

key 

if > 1 then 

foreach x G Counters do 

Ux G- rand {)', Xx G- rand () 

Zx G- 

mm{TUx, counleTsM eviction 
threshold of x 

if Zj, < then 

f Zx G- KeyBase (x) 

y G- arg max^^gCounters Zx', Delete y 
from Counters // key to evict 

T* G- Zy!! new threshold 

foreach x G Counters do 

// Adjust counters 
according to t* 

if Ux > max{T*, ('“^}/r then 

L CountersM ^ 

T G- T*\ delete u, r, z, b 
// deallocate memory 

else / / t£ < 1 

y G- arg maXa;gCounters KeyBase (x); 
Delete y from Counters // evict 
T^ KeyBase(y)// new 
threshold 

return (r; (a;, Counters[x]) for x in Counters) 


5.3 Estimators for Continuous SH^ 

We seek an unbiased and nonnegative estimator in a coefficient 
form, that is, a function (c) defined for any c > 0 and 

we use the estimator 

Q{f,H)= Y. . (12) 

xGHns 

Theorem 5.3. For any continuous f that is differentiable 
















II 


almost everywhere, the estimator that uses 

yf,r,e)(^c) = /(c)/min{l,£r} + /'(c)/r (13) 

is unbiased. 


Proof: We separately treat the cases where t£ < \ and 
> 1. We first show that when r£ > 1, /3(c) = /(c) + 
f'{c)/T are unbiased estimation coefficients. 

For a key of size w, we have density to have count 

of w — X G (0, w) (otherwise the key has count 0 and the 
estimate is 0). We can write 


P{y) = (/(y)e"^)'e-"^r-i 

Consider a key of size w. Its expected contribution to the 
estimate is 

PW 

/ p{'W — x)dx 

Jo 

rw 

I —TX/ p/ \ —t(w — x)\/ —t('UJ — x) —1 7 

= re (j[w — x)e ^ ^)e ^ ax 

Jo 

PW 

/ {f{w - x)e 
Jo 


= e J [J[w-X)e 
= e-"“/(«;)e-"” =/(H 


— t{w — x) \/ 


dx 


We now consider the case where t <l/i, showing that 
/ 3 ( c ) = /( c )/(^ t ) + /'( c)/t 

are unbiased estimation coefficients. For a key with weight w, 
we have density to have count of w — a; € (0, w). We 

write 

Piy) = {f{y)ey^ye~y/‘^T~'^ . 


re — x)dx 

Te-^l\f{w - 

pW 

= / {f{w - x)e^'^-^^/ydx = f{w) . 

Jo 

□ 

Note that any continuous monotone function, including the 
cap-p functions, is differentiable almost everywhere and hence 
satisfies the requirements of the theorem. 

We upper bound the CV of the streaming fixed-A: SH^ 
estimator (Proof is in Appendix |D.2| i: 

Theorem 5.4. The CV of estimating Q{caprp, H)from an SHg 
sample is upper bounded by 

^(l + max{£/T,T/4) 
q(k-l) 






In particular, when £ = 0(T), the CV is 0{q{k — 1) °'^), 
and when £ = T, the CV is at most 


2e- 1 


1 


: — 1 q(k — 1 ) 


0.5 


1.6(g(fc — 1) 


- 0.5 


6 Multi-objective samples 

We established that from a fixed-A: SH^ sample we can estimate 
well cap-p statistics when T — 0(£). This means that if we are 
interested in estimates with statistical guarantees for cap values 
T = [a, b], it suffices to use SH^ samples with parameters 
£i = 2®a for i < [log(&/a)]. To process a query for a cap^^ 
statistics, we can use the SH^ sample with £ that is closest 
(within a factor of from T. In particular, to estimate 
all cap statistics, it suffices to use |’log(max 2 . w^j min^; Wj;))] 
samples. 

We now improve over this basic approach by instead of 
working with a set {S'^} of samples with respective caps 
£ G L, we work with a single sample Sl = 'The 

improvement has several components: Sample coordination, 
which ensures that samples with closer £ are more similar 
so that \Sl\ ^ A:|L|, using estimators that benefit from the 
combined sample, and sampling algorithms that use state that 
is proportional to \Sl\. 

6.1 Sample coordination 

We coordinate the samples for different £ 0, m by using 
the same “randomization.” In our context, the randomization 
of each key constitutes of two independent random variables 
Hash(a;) ^ U[0, 1] and px ^ Exp[uix], which is the minimum 
over elements with key x of the Exp[r(;] component used for 
scoring elements {x,w). With coordination, we can express 
the seed of a; as a function of £ as: 

seedi{x) = px < f/£T Hashix)/£ : px ■ 

The sample Se includes the k keys with smallest seed^ values 
and its threshold re is the {k + l)st smallest seed^ value (or 
+00 if there are fewer than A: + 1 active keys). 

Surprisingly perhaps, we show that the expected number of 
distinct keys in for L = (0,oo) when the samples Sg are 
coordinated is at most k In n, where n is the number of active 
keys in the data set. In particular, this upper bounds [S'lI for 
any set of cap parameters L. 

Lemma 6.1. Let {S'^} for £ G L = (0,oo) be coordinated 
fixed-k SH^ samples. Then E[|S'i|] = E[| |] < A:Inn. 

Moreover, for o > 1, the probability of \Sl\ > aklnn 
decreases exponentially with a. 

Proof: Consider an order of all keys by increasing px. 
Any sample Si must have the form of the keys with k smallest 
Hash (x) values in a prefix of this order. We now consider 
the ith key in this order and the probability that it qualifies 
for some sample, which is the probability that Hash (x) is 
among the k smallest in the prefix of the first i keys. Since 
the random variables Hash (x) are independent of px and 
unrelated to the order, this is exactly the probability that the 
key is in one of the first k positions in a random permutation 
of size i, which is min{l, k/i}. Summing over all i we obtain 
the claim. Concentration follows from Chemoff bounds. □ 

A corollary of the proof is that the property x G Si holds 
for a contiguous interval of £ values that generally has the 
form {l/pz^f/px] for some key z. If px is amongst the k 
smallest among {p^ \ z G X} then the interval has the form 
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(l/j/zj+cx)) and if Hash (x) is amongst the k smallest in 
{Hash (z) \ z G X} then the interval is (0, l/yx]- 


6.2 Estimation 

We now consider estimators that leverage all the sampled keys 
X G sl ini- This allows us to obtain tighter estimates than 
when using any one sample Sg. We consider 2-pass sampling, 
so that Wx is available for sampled keys, and compute for each 
key X a probability that it is included in at least one 

of Sl for £ G L, when fixing the randomization on A" \ {x}. 

Once we have the probabilities we can apply an 

inverse probability estimate Q{f, H) = J2xgh f{Wx)/<^{Wx). 
Since is at least as large as the inclusion probability 

of X in any individual Si, the variance of the estimate for any 
query is at most that obtained by using any single sample. 

We now elaborate on computing the probabilities $. This 
probability can be computed from Wx and the set of pairs 
{£, for £ G L. Here, we define to be the k* smallest 
seedf(z) for z G A’\ {a;}. When x G Si, this is the threshold 
Ti and otherwise, it is seed^(z). 


Lemma 6.2. 

<i>(a;) 


Pr [3£ G L, C(£,x)] 

£xp[i«x ], [0,1] 


where C{£,x) is the condition y < max{r^ 1/f} and h < 

£Ti\ 

Proof: Using the independent random variables yx ^ 
Exp[?ii 3 ;] and Hash (x), the condition for inclusion of x in 
Sl is that 


yx < maxjr^ Xjf] and Hash(a:) < ^ . 

The condition for inclusion in Sh is that Hash (x)) 

satisfy the condition for at least one £ G L. □ 

When working with L that contains a contiguous interval, 
such as L = (0,oo), we can express the fcth and (fc + l)st 
smallest values in seed£(z) as a function of l/£ as a piece- 
wise linear function with at most fc In n pieces (in expectation). 
This allows us to compute $(wx) for all keys in a; G S'/,. 


and (2e/(e — 1))°-® « 1.8 (1-pass), which assume a worst- 
case frequency distribution (error is larger when Wx « £), and 
not reflecting the advantage of with-replacement sampling that 
is significant when there is skew. We therefore expect actual 
errors to be much lower than our upper bounds. 

Our sampling algorithms and estimators were implemented 
in Python using numpy.random and hashlib libraries. Sim¬ 
ulations were performed on MacBook Air and Mac mini 
computers. We did not attempt to benchmark performance in 
terms of running time, since computationally, our algorithms 
are similar to the widely applied distinct sampling or counting 
algorithms and can easily be tuned and scaled to very large 
data sets and common platforms. 

We generated streams of 10® elements with uniform 
weights. The keys were drawn from a Zipf distribution with 
parameter a = 1, 1.1.2,1.5,1.8, 2. This range of Zipf parame¬ 
ters is typical to large data sets and working with them allowed 
us to finely understand the error dependence on the skew (Zipf 
with larger a is more skewed and has fewer distinct keys per 
number of elements). The average number of distinct keys 
in our simulations, and respective sample sizes we used, was 
4.3 X 10^ for a = 1.1 (used k = 100); 1.84 x 10^ for a = 1.2 
(used k = 100); 3.04 x 10® for a = 1.5 (used k = 100); 841 
for Of = 1.8 (used k = 50); and 437 for a = 2 (used k = 50). 

For each stream, we computed the exact frequencies 
of each key for reference in the error computation of 
the estimates. For a set of sample cap parameters £ = 
1,5,20,50,100,1000,10000 (and also £ = 0.1 with contin¬ 
uous samples), we computed discrete and continuous fixed- 
k SH^ samples. Discrete SH^ sampling used Algorithm 
with scoring function (|^ and continuous SH^ sampling used 
Algorithm 

From each sample, we computed an estimate of the fre¬ 
quency cap statistic Q{cdfiq,, X) over all keys, for parameters 
T = 1, 5, 20, 50, 100, 1000, 10000. With discrete SH^, we 
used the estimator of the form 0 and computed estimation 
coefficients as in Theorem |4.1| With continuous SH^, we used 
the estimator ([T2| with coefficient function which for 
cap-j- statistics is; 


6.3 Sampling algorithm 

We can engineer the sampling algorithm so that it maintains 
state that is proportional to the number of distinct keys in [S'/, |. 
Let £i be our list of cap parameters in decreasing order. The 
algorithm maintains for each i, the fc + 1 keys with smallest 
Hash (x) amongst those with yx < l/£i. If for the highest £ 
values in L there are fewer than k + 1 keys with yx < 1 j£, 
we include the k + 1 keys with smallest yx- 

7 Simulations 

Our experimental evaluation is aimed to understand the error 
distribution of our estimators. Our analysis provided statistical 
guarantees on the errors that are close to the “gold stan¬ 
dard” attainable on aggregated data. The analysis, however 
is worst-case in terms of the dependence on the disparity 
max{f/T, T/f}, the factors of (e/(e— 1))®'® « 1.26 (2-pass) 


Pic) 


min{T, c} 
minjl, £ t } 


+ T ^Ic<T ■ 


For each £, T combination, we also computed the estimate 
that is obtained from 2-pass algorithms (Section |3T), applied 
with element scoring (|^ for discrete schemes and ( [T0| l for 
continuous schemes. We used the inverse probability estimate 
J2xCa'i-n{T,Wx}/^iwx), where is ([n) for continuous 

schemes and as outlined in Section |4] for discrete schemes. 

For each of these estimates, we computed the relative 
and NRMSE errors, averaged over multiple (rep = 200 or 
rep = 500) simulations (each using a fresh hash function and 
randomness). Selected simulation results showing the errors 
for £, T combinations are provided in Figure for discrete 
SHf and in Figure]^ for continuous SH^. The minimum error 
for each statistics T across samples £ is boldfaced. 
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Fig. 3. Simulation Results for Discrete SFI^ 


Discussion of resuits 

When looking at the parameter I with smallest error for each 
cap statistics T, we see the diagonal pattern expected from 
our analysis, where the error is minimized when I k. T 
and degrades with disparity between T and i. Note that the 
smallest distinct sampling threshold we had was r « 0.001 
(for a = 1.1), therefore, our high f values effectively emulated 
uncapped SH. 

Even for these realistic distributions, we observe that a 
considerable performance gain by using an appropriate sample 
for our particular cap statistics. We can also see that the 
sensitivity of the error to the parameter f increases with skew 
(higher Zipf parameter a). In particular, the ratio of the error 
to the boldfaced minimum when using a high £ sample to 


estimate distinct counts was up to a factor of 3 whereas the 
reverse could be 30 fold or more. The increase in error for 
mid-cap statistics by using the better one of £ = 1, oo instead 
of the minimum was up to 40%. Note however that even this 
is optimistic, as we measured error on the whole population 
- on segments with frequency distributions that do not match 
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Fig. 4. Simulation Results for Continuous SH^ 
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that of the population, error can be much higher]^ 

Comparing the error of 2-pass versus streaming estimates 
(both are the same for distinct counts £ = 1 but diverge 
otherwise), we observe that the beneht of the second pass is 
limited to 10% and typically lower. This agrees with our CV 
upper bounds which are only slightly larger for the streaming 
estimates. This suggests that the choice of scheme should 
depend on the computational platform. 

The £ = 1,T = 1 estimates have NRMSE « 1/v^fc^, 
this is because the upper bounds for approximate distinct 
counting are fairly tight Q, m as there is no dependence 
on the frequency distribution. In our simulations, for higher 
cap values T, the minimum error (over £) was typically much 
lower than the CV upper bounds. This suggests using adaptive 
conhdence bounds, based on sampled frequencies, rather than 
relying only on the CV upper bounds. 


8 Related Work 

There is a large body of work on computing statistics over 
unaggregated data which we can not hope to cover here. The 
toolbox includes deterministic algorithms 1291 . other sampling 
algorithms la, and Linear sketches (random linear projections) 
Eg), m, m, ca Deterministic algorithms work well 
for approximate heavy hitters and quantiles. Linear sketches 
project the key-weight vectors to a lower dimensional vector. 
Linearity implies efficient updates of the sketch when process¬ 
ing elements in a streaming setting. Most related to frequency 
cap statistics are sketches based on p-stable distributions that 
are designed to estimate frequency moments for p G [0,1] 
l25l and Lp sampling in, Ez). These techniques do not 
apply for cap statistics, as there are no appropriate stable 
distributions for cap functions. They are also specihc to the 
choice of p and there is no support for segment queries. Lp 
samples, which sample keys roughly proportionally to w^, are 
with-replacement, so less effective for skewed data, and have 
polylogarithmic encoding overhead. Of relevance to us is also 
a characterization of all monotone frequency statistics that can 
be estimated in polylogarithmic space and a single pass El- 
The construction, however, is mostly of theoretical interest. 
Generally, linear sketches have a signihcant encoding overhead 
and in practice, when updates are positive, are outperformed 
by sample-based sketches. In particular, all practical distinct 
counting algorithms are based on the sample-based MinHash 
sketches ifTSll . ifTTll . Il22l . 161 and for sum queries, weighted SH 
experimentally dominated linear sketches even in the presence 
of some negative updates El- 

3. To make this point clearer, our selected segment was the whole 
population, which means that for the segment, the number of samples was 
the same as when sampling using i = T. Estimation quality deterioration 
from disparity was only due to the allocation of sampling probabilities within 
segment. We can expect worst results (but again, theory bounds the worst-case 
pretty tightly), when adversely selecting segments. For example, sampling 
a skewed distribution with very large i and choosing segments with small 
Wx = 1 and T = 1. In this case, the segment can have a high fraction of 
distinct keys but a small fraction of total weight and will obtain very few 
samples. 


Conclusion 

Lrequency cap statistics are fundamental to data analysis. 
We propose a principled and practical sampling solution for 
scalably and accurately estimating frequency cap statistics 
over unaggregated data sets. The sample is computed using 
state proportional to the specihed desired sample size and the 
estimates have error bounds that nearly match those that can be 
obtained by an optimal weighted sample of the same size that 
can only be computed over the aggregated view. Our design 
brings the benehts of approximate distinct counters, which are 
extensively deployed in the industry, to general frequency cap 
statistics. 

Looking ahead, we would like to apply our framework for 
sampling unaggregated data sets to other statistics, extend 
it to support negative updates ESI, El, and understand the 
theoretical boundaries of the approach. 
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Appendix A 
Count distribution 


We provide the proof of Theorem |5.2| 

Proof: We first consider fixed-r SH^. We use ^{w) •[H), 
which is the probability that a key of weight w is sampled. 
This is the same as the probability of a key with w 

starting to get counted after processing y < w of its weight. 
The partial derivative of $(w) with respect to w is the density 
function (j){y) on the weight y at which a key starts getting 
counted; 




d^{w) 

dw 

min{l, t£} max{l/f, r} exp(—w max{l/£, r} 
r exp(—w max{l/.f, r} . 


For a particular key x, the density function of is equal 
to (j) when in the range Elsewhere we have that 

Jw 4>{y)dy = 1 — $(^ 3 ;) is the probability that x is not 
sampled. 

We now establish our claim for the fixed-size sampling algo¬ 
rithm. We start with a precise definition of the conditioning we 
use. The randomization used for a key x includes the random 
hash value used in KeyBase (x) , the randomization used to 
assign scores to all the elements of x, and the random Ux^ Zx 
(freshly drawn per eviction step) used to adjust the counters. 


Observe that given that a key x is cached, the threshold value 
T only depends on the randomization of the other keys but 
not on that of x. When x is not cached, the threshold value 
T may depend on the randomization used for x. But in this 
case, Cx = 0. 

We show that after each step, the distribution of the final 
value of Cx has the claimed density, when conditioned on 
the current threshold r. Correctness when r does not change 
follows from the treatment of the fixed threshold case. It 
remains to consider eviction steps. Let r be the threshold value 
before eviction and let t* be the new threshold value after 
eviction. If the new t* is determined by our key x, then our 
key X is evicted. Otherwise, the new r* is determined by the 
fcth largest Zx among all other keys. This value depends only 
on the randomization of other keys and not on Ux,rx- 

We now need to show that the particular computation we 
used for the count adjustment preserves the claimed form of 
the distribution. For that we assume that the distribution of 
Cx was as claimed with respect to the initial threshold r. We 
express it as a function of the final threshold t*. 

We first consider the case rf > 1 and t*£ > 1. With 
probability t*/t the density at y is the same and with 
probability (1 — t* jr), it is the integral over u of the density 
with T at u < y and the density of a new deduction, which 
is Exp(r*) at y — u. Now observe that the density of the 
deduction conditioned on r before the adjustment was (by 
our assumption) We obtain 

Cre-^y + (1 _ U) r 

T T Jo 

* n r 

= re -|-r(r —r)e / e du 

Jo 

* —Ty I * —T*yt-t —y(T — T*'\\ 

= re +T e (1 — e ' 


We now consider the case that rf > 1 and T*i < 1. With 
probability 1 — we have KeyBase (x) > r* and the final 
count is 0. With probability = t*/t, we have Ux < 

l/(fr) and KeyBase (x) < r* and the count remains the 
same. Otherwise, with probability (1 — = t*{£ — 

T~^) we have Ux > f/ifT) and KeyBase (x) < r*. In this 
case we consider the density y of the sum of the previous u 
and new deduction {y — u) ^ Exp(l/£). We obtain 

— re”’^^ -I- r*(£ — r“^) [ 

Jo 

= -f r*(r - T 

Jo 


Last, we consider the case t£ < 1. With probability 
t*/t the key maintains the same count, since conditioned 
on KeyBase (x) < r, we have KeyBase (x) < r* with 
probability T*/r. Otherwise, the count is 0. So we obtain the 
density 

llre-y/^ = T*e-yl^ . 

T 

□ 
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Appendix B 

ppswoR Variance Analysis 


In our variance analysis, we make use of the following notion 
of domination of a distribution by another distribution (or 
function); A distribution on ?/ > 0 with density function b 
is dominated by a function s if 

Ky)dy < / s{y)dy . 

We will use domination to bound variance. We will compute 
the variance v{y) conditioned on a threshold value y and then 
compute the unconditioned variance as the expectation of v{y) 
over the distribution of the threshold. We will use a dominating 
distrbution to bount the variance using the general property: 

Lemma B.l. Ifv{y) is non-increasing in y and b is dominated 
by s then b{y)f{y)dy < /~ s{y)v{y)dy. 

We are now ready to upper bound the CV for ppswor. Our 
bounds for other sampling schemes build on this ppswor proof. 
We start with some basic lemma. We use the notation sw,k and 
Sw,k respectively for the density and cummulative distribution 
functions of the Erlang distribution Erlang(l/E, A:), which is a 
sum of k independent exponential distribution with parameter 
W. 



Lemma B.2. Let B be a distribution which can be expressed 
as the sum of k independent exponential distributions, each 
with parameter that is at most W (the set of parameters can be 
a random variable and parameters may not be independent). 
Then B is dominated by Erlang{W, k). 


Consider now ppswor sampling of X with respect to weights 
Wx and let W — Wz- For a particular key x, let Bx be the 
distribution of the fcth smallest seed value in A" \ x. 


Lemma B.3. For all keys x, the distribution Bx is dominated 
by Erlang{W, k). 


Proof: Let t' ^ Bx- By definition, r' is the fcth smallest 
of independent exponential random variables with parameters 
Wy for y G X \ X. From properties of the exponential dis¬ 
tribution, the minimum seed is exponentially distributed with 


parameter ^ 


y&X\x 


= W—Wx - Conditioned on a particular 


key zi having the smallest seed, the difference between the 
minimum and second smallest is exponentially distributed with 
parameter W — Wx — wi, where wi is the weight of the key 
zi with minimum seed, and so on. Therefore, the distribution 
on t' conditioned on the ordered set of smallest-seed elements 
is a sum of fc exponential random variables with parameters 
at most W. The distribution Bx is a convex combination of 
such distributions: One distribution for each possible ordered 
subset of size fc of A” \ a;, and each such choice has probability 
equal to the probability of the ordered subset being the first fc 
keys of ppswor sampling from X \ x,. 

Therefore, from Lemma B.2 the distribution Bx (for any 


x) is dominated by Erlang with parameters (IT, fc). 


□ 


Theorem B.l. Consider ppswor sampling with respect to 
weights f{wx). For a segment H with proportion q = 


Q{f,H)/Q{f,X), the CV of the estimate Q{f,H) (|^ is at 
most {q(k — 1))“° ®. 

Proof: We extend the analysis in 0, m (note that here 
we take fc to be the sample size without the threshold (fc -f 1 
smallest seed) whereas in 0, 0 fc is larger by 1). 

WLOG, since we are considering sampling aggregated data, 
we assume Wx = f{wx). Let W = 'Yhx^x'^^ 
weight of the population. 

We first consider the variance of the inverse probability 
estimate for a key with weight w with respect to a fixed 
threshold t. The variance is (1/p— l)ru^, where p = 1 — 
and is at most 

g-rw 

\/ar\wx I t1 = vP' - < w/t , 

L I J 2^ _ q — TW — ' ’ 


using the relation e“'^/(l — e~^) < \/x. 

We now consider the “perspective” of a key x and the 
distribution Bx of the fcth smallest seed value in Af \ x. From 
Lemma B.3 Bx is dominated by Erlang(IT, fc). 


We will bound the variance of the estimate using the relation 


\lar[wx\ = E.r'~B,,Var[r& 2 , | r' 

Since the conditioned variance \lar\wx 


T IS non¬ 


increasing with r', from Lemma B.l domination implies that 

Er'~s,Var[wa; | t'] < E.r'~Sw,feVar[r()a; | r'] . 

Therefore it suffices to upper bound the expectation for Sw,k- 
We now use the Erlang density function mu 


SW,k{x) = 


W^X 


k ^k—1 


^-Wx 


(fc-l)! 


and the relation x“e ^^dx = to bound the 


variance: 

Varlwx 


< 


< 


sw,k(z)var[wx | z\dz 


< w 


lo (fc-l)! 


(k-1)1 Jo 




,k-2e-wz 


dz = 


wW 
fc-l ■ 


Since covariances between different keys are zero ifTH . the 
variance on a set H with weight w{H) is the sum of variances 
Var[w{H)] < w{H)W/ (fc-l). We divide by w{H)^ and take 
the square root to obtain an upper bound on the CV. □ 


Appendix C 

SH SUM STATISTICS VARIANCE 

We now consider the fixed-fc SH estimator applied for sum 
statistics f{w) = w. The estimate is Q{w, H) = + 

T”^) Q. The estimator has at least the variance of the ppswor 
estimator, since it has the same distribution over keys, and 
the ppswor inverse probability estimate, which can not be 
applied with SH, minimizes variance for this distribution (over 
all unbiased estimators that can be expressed as a sum over 
sampled keys of per-key unbiased estimates). Surprisingly, we 
obtain the same upper bound on the CV: 
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Theorem C.l. The SH sum estimator on a segment H with 
proportion q = w{H)/w{X) has CV of at most 

Proof: We bound the variance of the fixed-fc SH estimate 
on a key x conditioned on r. With probability e~™ the key is 
not sampled, the estimate is 0 , and the contribution to variance 
is vf. Otherwise, with density the count is Cx = w — y, 

the estimate is Cx +t~^, and the contribution to the variance 
is (r“^ — y)^. We obtain 

pW 

\/ar[wx\T] = — y^dy 

Jo 

= t“^(1 — e~™) < w/t . 

The last inequality uses the relation 1 — e~^ < x. 

The distribution of r', the fcth smallest seed in X\x, is the 
same as in ppswor and we can conclude as in the proof of 
Theorem |B.1| We take the expectation of this variance over 
the distribution Sw,k which dominates Bx and using the zero 
covariances, bound the variance on w{H) by □ 

Appendix D 

SHf CAP STATISTICS VARIANCE 

We bound the variance by relating the distribution of seed(a;) 
under fixed-A: SH^ to the distribution with ppswor with respect 
to key weights min{uia;,T}, that is, using seed(jc) ^ 
Exp[min{r(; 2 ;, T}]. For the purpose of this analysis, we work 
with SHf seed distribution that is exponential when con¬ 
ditioned on y < 1/i instead of uniform. This does not 
change the algorithm or estimation, since there is a monotone 
transformation that preserves seed order. The SH^ density 
function of the seed of a key with weight w is 

1 _ p-wji 

biAv) = {y< : we-'^y . (14) 

1 — 1 /e 

We now can relate bg^xu^iy) to Exp[min{wa;,T}]: 

Lemma D.l. For any key x and cap value T, the density 
function bi^xi,^{y) of seed{x) under fixed-k SFIi is dominated 
by 

— ^ max{1 , e./T) T}p.- min{^.,r}y _ 

e — 1 

Proof: We show that for any point z > 0, 

/ bi^xoiy)dy 
Jo 

< [ — ^ —max{l, —} min{w, r}e~ 

Jo 6 “ 1 T 

e n ^ 1 /I — mintTu.rtzN 

= -rmaxjl,-}(l-e ^ ’ M (15) 

e — 1 1 


The proof is via case analysis. We start with z < 1//. We 
have that 


, 1 - 
1 — 

1 — e““'' 


fZ 1 _ p-wp 

/ bi^xoAy)dy = / JeT^y— -^ 

Jo Jo 1 -e ^ 

1 - 


Therefore, to establish the claim we need to show that 
(l-e-^")(l-e-’"/^) <max{l,|}(l-e-““{’"’^>") (16) 

• Case w <T: 

(l-e-^^)(l-e-’"/^) < (l-e-“^) = 


using the relation 

Va, b>0, (1 — e““)(l — e~^) < 1 — e 

and thus holds. 

• Case w > T and i> T: We have 


(17) 


(l-e-"")(l-e-“/0 < 1-e-"" < ^(1-e-"") 


y(l 


_ — min{it),T}2\ 


The last inequality follows from the function (1 — 
exp(—x)/x being monotone decreasing; Therefore £ >T 
implies tz > Tz and thus 
implies 1 — < |;(1 — 




< —- which 

I. We therefore obtain 


_ m.in{it;,T}2\ 


that ( [T 6 ] | holds. 

• Case w >T and i <T: We have 

(l-e-^")(l-e-’"/^) < ( 1 -e-^") 

= (1 

and therefore (ED holds. 

We now consider z > Xjl. We have 
/■i/r 

/ be,wAy)dy = 

^0 

/ bi^x,Ay)dy = e~ 

Ji/i 

Thus, bi^u,^{y)dy = 1 — . To verify ( E5| l, we need to 

show that for all z > l/£, 

< —^maxfl,-}(l-e-“”<“’^>^) . (18) 

e — 1 T 

This is immediate for w <T. We now consider w >T. Since 
1 — < 1 , it suffices to show that 

^max{l,|}(l-e-^^)>l. 

Using z > l/£, it suffices to show 

^max{l,|}(l-e-^/0>l . (19) 

e — 1 1 


If T > £ then we substitute 1 — e > 1 — e ^ to 

show that ED holds. If T < £ we have T/£ <\ and use the 
inequality (1 — e““) > a(l — e~^) for a < 1 to obtain; 

f X' - 7<‘ - =-‘) = 1 . 

□ 

We are now able to express, for a key x, a dominating 
distribution to the distribution of the kth smallest seed in T’\a; 
when using fixed-A: SH 7 . 
















19 


Lemma D.2. The distribution B of the kth smallest seed, 
where seeds for z £ X \ x are independently drawn from 
is dominated by the function 

6 £ 

--max{l, -}sw,k , 

e — 1 1 

where 

W = ^ min{?Wz,T} 

zGX 

and s\w,k is the density of Erlang{W, k). 

Proof: From Lemma |D.1[ B is dominated by 
max{l,£/T} times the density of the fcth seed according 
to Exp[min{r(;a,,T}]. The latter distribution is dominated by 
sw,k- We get the claim from transitivity of domination. □ 


D.1 CV bound for 2-pass estimator 

We are now ready to bound the variance of the 2-pass hxed-A: 
SH^ estimator. Recall that the estimator is applied to an SH^ 
sample which includes the exact capj.(r(;a;) values of sampled 
keys. 

We first bound the variance for a fixed r. 


Lemma D.3. 

r - / Nil rT mini Wx,T\ 

Var[caprj.{wx) I r] < max{ —,!}■-- . 

l T 

Proof: The inclusion probability of a key x conditioned 
on the threshold r is 


rTf>l : ( 1 -e-™-) 

Pr[seedU)<r] = | 

"( 20 ) 

The variance conditioned on r of the inverse probability 
estimate is 


Var[cap7.('u;a;) | t] 


( 


1 

Pr[seed (x) < r] 


1) minima;, T}^ . 

( 21 ) 


For t(. > 1 we. have 

(-^-1) < —^- 

rr[seed(x) < r] 1 —e 

1^1 

_ X — TWx 

The last inequality uses the relation < 1 + a; for x > 0. 
Substituting in (| 2 T]i we obtain 


r - /NIT r 9 1 min|wr,T| 

Var[capj.(w2;) I r] < min{w 2 :,r} - < - . 

It remains to treat the case rf < 1. We will establish that 


_ L _1 < _ . (22) 

(1 — e“^'^)(l — 6“*"/^) ~ Ta\n{wx,E\T 

We first consider w > l.ln this case the right hand size is 
hxed at To maximize the left hand size, over w > i, 

we take w = £. We then obtain that the left hand size is at 
most which establishes ( | 22 ] l. 

We next consider w < £, recalling that we already assume 
t£ < 1, and thus have w < £ < 1 /r. To maximize the left hand 
side of ([ 22 li under these assumptions we need to minimize 


the denominator h{£) = (1 — e“^'^)(l — in the range 

w < £ < Ijr. By taking the derivative > 0, we see that 
it is negative in this range. Therefore, h{£) is minimized for 
£ = \/t. Substituting £ = 1/r, we obtain that the left hand 
size of ( | 22 l l is at most and thus ( | 22 ] l is fully 

established. 

We now note that the left hand size of ( |22l l is equal to 
Pr[seed(a;)<T] ~ Substituting in ( |2T] i, we obtain that 


Var[capy{w) 


Pr[seed (x) < r] 


— 1 ) min{w, T}^ 


< 


£t 


^{w,Ty 


. ,w T.minjrUjT} 

s ■”■"{ 7 . 7 )— 

The second to last inequality uses our assumption that w < £. 

□ 

We are now ready to conclude the proof of Theorem HD 
We bound the variance with respect to the distribution B using 
the dominating function as in Lemma D.2 We obtain that the 
variance for key x is 


r - / M e ,T f^ minjiUa;, T| 

Var[cap^ max{-,-}VF-- . 

We conclude as in the proof of Theorem 113 showing 
that our estimate of (5(cap^,iF) for a segment H with 
proportion q of the capj- statistics has CV that is at most 

/ e max{T/ 7 , 7 /r} \ 

\^e-l q{k-l) J 


D.2 1-pass variance bound 

We provide the proof of Theorem 5.4 which bounds the 
variance of the 1 -pass estimators. 

The estimators are applied to the same sample distribution 
of included keys and the proof outline is similar to that 
of the 2-pass estimator (Theorem |53- The only compo¬ 
nent we are missing is a bound on the conditional variance 
Var [capy(?Ua;) I t]: Since the exact weight Wx is not available, 
we can not apply Lemma [D3] and instead compute the variance 
of the 1 -pass estimator that is applied to Cx- 


Lemma D.4. 

r - / Nil "nim{wx,T} / £ -tU\ 

Var[capT{wx) I r] < ^ (^^(1 - e + -jj 

< n I T m.m{wx,T} 

Proof: 

The estimation coefficients /3(c) are provided in Theorem 
15.31 We bound 

var[cap^(u;) I r] = E[/3(c)2] - E[/3(c)]2 

= E[/3(c)^] — min{T, . 

The last inequality follows from unbiasedness E[/3(c)] = 

capj.{w). 
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For capj^, we have f'{x) = 1 for x < T and f'{x) = 0 
otherwise. Therefore, 


13(c) = 


min{r, c} 


^-1 


Ic<T 


min{l, £t} 

We use the density function of c given w which is provided 
in Theorem 15.21 The density function for w > c > 0 is 

have that /3(c) = 0 when c = 0. We now use case analysis. 

We first consider the case ri > 1 and w < T. We have 
/3(c) = c + 1/r. 


e"(“'"^)/^(f2 +T2) 





^2 g-(„-r)/r^2 



< 

< 


T 

T 

T 

T 



+ Ie-™/^(e^/^ - 

+ 1 ( 1 -.-’''")) 


-(l + T/f) 

T 



E[/32] - 


pW 

/ Te~^^P{w — x)‘^dx 

Jo 

pW 

—w'^+ / Te~'^^{T~^ + w — x)'^dx 

Jo 


The second to last derivation substitutes w = T for the w 
value that maximizes the expression subject to w > T. We 
then use (1 — e~^) < x. □ 


= (1 — e ’^“')(u;2 + r '^) — w‘^<w/t 

The last inequality uses the relation (1 — e“®) < x. 

For t£ > 1 and w > T, we have /3(c) = T for w > c> T 
and /3(c) = c + l/r for c < T. Therefore, 

E[^ 2 ] _ y 2 

pw—T pw 

— / re~^^T^dx+ / Te~^^-\-w — x)^dx 

Jo Jw-T 

= —1 + (1 — e ^ ^)1 +e ^ +T ) — T e 

— 2 —TW/ tT ~2/-i —TT\^rr>/ 

= T e (e — l)<r (1 — e )< T/r . 

Using the fact that is maximized (subject to w > T) 

when w = T and (1 — e~^) < x. 

For t£ < I and w <T, we have /3(c) = T~^(cf£ + 1). 


Appendix E 

Discretized threshold sampling 

A variation on the fixed-fc discrete SH^ sampling scheme that 
can be useful in practice is to limit the algorithm to work with 
a discrete set of thresholds (think t = a* for some a < 1 ) (see 
Algorithm]^. When the cache is full, the threshold is adjusted 
in iteration until its size drops below k. Discretized thresholds 
/with fixed-A: SH were considered in ||9l. Discretized thresholds 
have the advantage that the number of times keys are pulled 
out/placed back on the priority queue for updates is lower. 
Another advantage is that the estimators, when expressed as 
coefficients which depend on i and r, can be reused with 
different samples. 


E[/32] = 


< 

< 



T ri 

"( 1 ( 1 -«-/')+ 


w 


^U(i- 


o-T/i 


) + 


w T 

7(1 + 7 ). 



The second to last inequality follows from the function (1 — 
e“®)/x + X being increasing in the range x > 0. Therefore, 
subject to fixed i and w < T, the expression using x = wl£ 
is maximized at w = T. 

For t£ < I and w > T: 


E[/32] = 


pw — 1 

/ l(T£fdx + 

Jo 

pW 

/ + (tt) — x)/(rf))2dx 

Jw-T 

= (Tf)-iT2(l - + 

pW 

(r£)-i / {l/£)e-^/^(£ + w-x)^dx 

Jw-T 


Appendix F 

Approximate capt Counters 

Our design computes a sample of the active keys, over which 
segment statistics can be estimated. One can also consider the 
more basic problem of only estimating the statistics over the 
full data set Q(f,X). 

The case capj^ corresponds to a (approximate) distinct count 
of keys, which is a fundamental and well-studied problem. The 
case caPoc is the total weight of the full stream and can easily 
be computed. 

Our constructions can be modified to be more efficient when 
we are only interested in estimating (5(caP(p, A): Instead 
of storing full identifiers of cached keys, which are needed 
for a sample, we can hash the key domain to a domain of 
size that is polynomial in the number n of distinct keys. 
The resulting sketch size in this case would be O(logn) to 
represent each key hash and the count (which we can cap 
by a polynomial in T, to ensure the representation of the 
counts is at most O(logT). The result is an approximate 
cap (7 counter on streams that has state (structure size) that 
is 0(e“2(logT + logn) and provides estimates with CV of e. 

State of the art approximate distinct counters, however, have 
a smaller, double logarithmic dependence on n ini, ES- 
We present here a light weight algorithm that provides a 
rough approximation of Q{caprp, X) with double logarithmic 
state. We apply to each stream element the string returned by 
the element scoring function Element Score (h) used 





21 


Algorithm 6: stream sampling: max size k and discretized 
thresholds _ 

Data: sample size k, a < 1, stream of elements from X 
Output: set of k pairs {x,Cx) where x G X and 

Cx G [i^,'Wx] 

Counters •«—0 // initialization 
r •<—1 // Sampling Threshold 
// Processing a stream element of key x 
foreach stream element h with key x do 

if X is in Counters then 
1^ Counters[r] Counters[r] + 1 

else 

seed(r) ^ ElementScore (h) 
if seed (x) < t then 

Counters^ 1 
while |Counters| > k do 
T G- ar 

while 

maxjseed (x) \ x in Counters} > r do 
2 /^ argmaxjseed (r) | 

X in Counters} 

while Counters[y] > 0 and 

seed (y) > r do 

Counters|>] Counters|>] - 1 

seed (y) G- 
ElementScore (y) 

if Counters[y] == 0 then 
|_ delete Counters[y], seed(y) 


return (t ; (x, Counters[r]) for x in Counters) 


there is inherent error of 1 — 1 /e, there is no point in using a 
more accurate distinct counter) 

One approach to reduce the error, left for future work, is to 
apply the counting with multiple values of T. 


in our discrete SHt algorithm. We then apply any off-the- 
shelf approximate distinct counter ifTSll . 03, ll22]l . ||6l to 
the stream of ElementScore (h). Recall that the elements 
being counted are the identifiers of key-bucket pairs from the 
original stream, where a bucket b ~ U[l,... ,T] is drawn 
independently for each stream appearance of the key. 

We now analyse the quality of this approximation. 

Lemma F.l. The expected number of distinct strings gener¬ 
ated is between (1 — l/e)Q{caprp, X) and Q{caprp, X) . 

Proof: The expected number of distinct strings that are 
generated for a key of cardinality w is £(1 — (1 — l/T)*"). This 
is because the probability that we do not hit a certain bucket 
with w elements is (1 — 1/T)’". Thus, the expected number 
of empty buckets is r(l — 1/T)’". 

So in expectation, a distinct counter applied with T buckets 
would produce an underestimate. The worst relative error is 
obtained for keys with w = T, where the expected count 
is T(1 — 1/e), thus the relative error is 1/e. However, the 
error depends on the distribution of key sizes, and is small for 
cardinalities much larger or smaller than T. □ 

This approach can obtain a rough estimate of (3(capp, X) 
to within (1 — 1/e, 1) using state of size O(loglogn). (Since 











